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(57) The invention relates to an automated system 
for monitoring wildlife auditory data and re- 
cording same for subsequent analysis and iden- 
tification. The system comprises one or more 
microphones (8) coupled to a recording ap- 
paratus (10) for recording wildlife vocalizations 
in digital fonmat. The resultant recorded data is 
preprocessed, segmented, and analyzed by 
means of a neural network to identify the res- 
pective species. The system minimizes the need 
for human intervention and subjective interpre- 
tation of the recorded sounds. 




Bird 

Ucnd/tcuhn fclodolc 


Idcmlkaboa Module 




*2i 42b 


Cow of Cdti 


Couni ol Ctlti 
PxiicuU- SfKcici 





Best Available Cc 



Jouve. 18. rue Saint-Denis. 75001 PARIS 



1 



EP 0 629 996 A2 



2 



FIELD OF THE INVENTION 

This invention relates to an automated monitoring 
system for monitoring wildlife auditory data, thereby 
to obtain information relating to wildlife populations in 5 
a terrestrial environment. 

BACKGROUND OF THE INVENTION 

Prior to granting approval for proposed construe- io 
tion or development, government authorities are re- 
quiring more comprehensive studies on their poten- 
tial impact on the natural environment. Consequent- 
ly, the measurement of the effects of human interven- 
tion on the natural environment, particularly on pop- 15 
ulations of rare or endangered species of animals and 
on the diversity of animal species, is an importarit re- 
quirement 

Currently used wildlife monitoring schemes typi- 
cally involve experienced terrestrial biologists or sur- 20 
veyors entering the environment to be monitored and 
making first hand auditory and visual inspections of 
the terrestrial environment. These manual inspec- 
tions may be unsatisfactory for several reasons. First, 
manual inspections are labour intensive, may require 25 
large numbers of individuals in order to cover a suffi- 
ciently representative territory of the environment, 
and may therefore be very difficult to perform; if suf- 
ficient resources are not available the results ob- 
tained may not be reliable. 30 

Second, manual inspections may require lengthy 
periods of time to complete. Typically, the environ- 
ments to be monitored are remote from city centres 
and, the travelling time to and from the site may be 
substantial. At the site, surveyors must approach the 35 
environment with great care in order to avoid disrupt- 
ing the terrestrial environment 

Third, manual Inspections are limited in their 
scope, because they are usually restricted to daylight 
hours and, as a result, nocturnal species are not gen- 40 
eraily observed. Animals may also be visually ob- 
scured by forest vegetation, and some environments 
such as swamps and marshes may not be easily ac- 
cessible. 

Finally, the integrity of any measured data is sub- 45 
ject to uncertainty and error due to the highly subjec- 
tive nature of auditory and visual observations. 

Automatic monitoring systems have been devel- 
oped in order to perform the monitoring function of 
surveyors. These systems have typically used con- so 
ventional analog recorders to collect the animal vo- 
calizations, or calls, from a representative area. How- 
ever, recorders for recording long term terrestrial data 
(beyond 12 hours) at multiple sites and having a wide 
bandwidth (up to 10 kHz) are relatively expensive. 55 
Furthermore, the analysis of the recordings has to be 
conducted by surveyors out of the field, which in- 
volves labour intensive analysis by biologists who are 



deprived of the benefit of being present in the physical 
environment when identifying the call. 

SUMMARY OF THE INVENTION 

The present invention overcomes the drawbacks 
associated with manual inspections by providing an 
automated system which permits *ie continuous re- 
cording of animal vocalizations with a minimum of dis- 
turbance to the terrestrial environment The present 
invention also permits the monitoring of a significant 
area with minimum labour requirements, and reduces 
misidentif ication resulting from observer biases. 

An automated monitoring system in accordance 
with the present invention comprises means for re- 
ceiving auditory data from the wildlife vocalizations 
being monitored, means for recording the auditory 
data in digital format, means for processing the re- 
corded data, and means for identifying predeter- 
mined characteristics of the recorded and processed 
data thereby to identify the wildlife species from 
which the vocalizations are derived. 

In order that the invention may be readily under- 
stood, preferred embodiments thereof will now be de- 
scribed, by way of example, with reference to the ac- 
companying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is an overall schematicdiagramof a wild- 
life vocalization identification system according 
to the invention; 

Figure 2 is a schematic diagram of the receiving 
and recording systems of Figure 1; 
Figure 3 is a schematic diagram of the identifica- 
tion module of Figure 1 according to one embodi- 
ment of the invention; and 
Figure 4 is a schematic diagram of the identifica- 
tion module of Figure 1 according to a second em- 
bodiment of the invention. 

DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENTS OF THE INVENTION 

Overall Scheme 

As shown in Figure 1, a receiving system 8 is 
used to receive animal vocalizations from the terres- 
trial environment to be monitored. The auditory data 
so received is then recorded in digital format by a re- 
cording system 10, and thereafter is analyzed by a 
data analysis system 12. The data analysis system 
also formats the data so that it may be processed by 
an identification module 14 which can identify the 
family, genus or the species of the animal from which 
a call originated. In addition, the data analysis system 
12 can be used to calculate population estimates of 
the animals when the vocalizations are obtained from 
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several, at least three, locations. 

The data analysis system 12 may also be equip- 
ped with a digital-to-analog (D/A) converter 33 and a 
speaker 35 in order to enable surveyors or biologists 
to perform auditory identification of the signal as a 5 
verification of the results of the identification module 
14. 

Automated Recorder System 

10 

The recording system 10 is an intelligent, multi- 
channel audio-band recorder As shown in Figure 2. 
this recorder consists of a conventional portable per- 
sonal 16-bit computer 20 equipped with a 20 MHz in- 
ternal clock and a 510 Megabyte hard disk. An ana- 75 
log-to-digital (A/D) board 22 having 16-bit resolution 
Is connected to the computer 20. Microphones 24, 
representing the receiving system 1 0 of Fig. 1 , may be 
connected to the AID board 22 using various commu- 
nication links so that, for example, each channel has 20 
a bandwidth of 8 kHz, a dynamic range greater than 
70 dB and a signal-to-noise ratio of 93 dB. One or 
more microphones 24 may be tethered directly to the 
A/D board 22, the microphones being located so as to 
pick upthe animal vocalizations from the selected en- 25 
vironment Alternatively, or in addition, microphones 
26a, 26b and 26c located remotely from the A/D board 
22 communicate with the AID board by way of a radio 
frequency link. In this case the microphones 26a, 26b 
and 26c are coupled to radio frequency transmitters 30 
28a, 28b and 28c operating on separate frequency 
channels. Separate channels on the A/D board 22 are 
connected to radio frequency receivers 30a, 30b and 
30c, which are dedicated to the radio frequency trans- 
mitters 28a, 28b and 28c, respectively. It will be un- 35 
derstood by persons skilled in the art that other tech- 
niques for transmitting signals from remote locations 
to the computer 20 may be employed. Furthermore, 
while the system is shown in Figure 2 as having four 
microphones, it will be understood that one or more 40 
microphones located at various locations in the field 
may be used. 

The acoustic data received by the microphones 
24 and 26 is transmitted to the A/D board 22 where it 
is converted to digital form and stored on the hard disk 45 
associated with the computer 20. The computer 20 
may be powered using a standard 120 volt AC supply, . 
if available, or a 12 volt DC battery. 

In order to conserve the limited storage space 
available on the hard disk of the computer 20, it may 50 
be necessary to minimize the recording of extraneous 
sounds, such as ambient background noise. Two 
techniques have been designed to achieve this pur- 
pose: time triggered recording and sound activated 
recording. With respect to the time triggered record- 55 
ing technique, the operator can predetermine the dur- 
ation of the recording time of a collection period and 
the time interval between successive collection peri- 



ods. Recordings can therefore range from continuous 
to discrete sampling periods. Discrete sampling peri- 
ods allow the user to monitor the environment at times 
when the species being monitored are likely to be 
most vocal. All of the channels rpay be time triggered 
using the same or a different recording time and in- 
terval between collection periods. However, when the 
time triggered technique is used In association with 
the preferred embodiment described below all chan- 
nels are time triggered simultaneously. 

The sound activated recording technique pro- 
vides a system that begins recording when triggered 
by an acoustic transient attaining a predetermined 
minimum amplitude. In the preferred embodiment, a 
transient received on one of the microphones 26 or 24 
will cause the computer 20 to record the environment 
sounds received by all of the microphones whether or 
not a triggering acoustic transient is received by 
those microphones. Once the amplitude of the envir- 
onmental sound falls below a threshold level on all 
four channels, the computer 20 ceases to record the 
environmental sounds. Between triggering tran- 
sients, the computer transfers the data from its inter- 
nal memory to the hard disk. The system records the 
time at which the triggering transient occurred in or- 
der to assist the analysis of the vocalizations. It will 
be understood by those skilled in the art, that the sys- 
tem may alternatively be designed so that the com- 
puter 20 will only record those channels receiving a 
triggering transient. In the latter case, only when the 
environmental sound received by the triggered indi- 
vidual microphone falls below a threshold will the 
computer cease recording on that channel. Alterna- 
tively, if desired, some of the channels may be time 
triggered and the others sound activated. 

In an alternative embodiment of the invention, 
the recording system 10 consists of individual analog 
, recorders situated in the field and interconnected so 
that they commence recording using either a time 
triggered technique or a sound activated technique. 
The recorders may commence recording either simul- 
taneously or individually. A suitable commercially 
available time-triggered recorder is the SONY Pro- 
fessional Walkman Model WM-D6C. As mentioned 
above, the analog recorders are restricted in the 
length of real-time recording that can be made. 

In each of the above embodiments, the recorder 
may be equipped with sensors 22a capable of collect- 
ing environmental data, such as temperature, humid- 
ity, barometric pressure and wind velocity, which can 
be related to the activity level of animals. For exam- 
ple, it is commonly known that amphibian vocaliza- 
tions vary with temperature and humidity levels. The 
acoustic pressure level may be recorded, preferably 
at 2 second intervals for each channel during the per- 
iod the computer is recording on the channel, using 
either the time triggered or the sound activated op- 
tion. Information pertaining to the acoustic pressure 
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level may be used to estimate the relative abundance 
of species, as will be discussed in nnore detail below. 

Data Analysis 

5 

The data analysis system 12, consisting of a con- 
ventional personal computer and individual pro- 
grams, is used to analyze the environmental sounds 
recorded by the recording system 10. As mentioned 
previously, the recording system 10 may be automat- io 
ed, such as the system described with reference to 
Figure 2, or, alternatively, recordings may be made 
using standard analog recorders. In order for the en- 
vironmental sounds to be analyzed by the identifica- 
tion module 14, they must first be formatted into dig- is 
ital data files. The sounds recorded by the automated 
recording system, shown in Figure 2, are stored in a 
digital format on the hard disk of the computer 20. 
When analog recorders are used as the recording 
system 10, an AID converter, connected to the analy- 20 
sis system's personal computer, is used to convert the 
analog recordings to digital files, 

it is useful to analyze the digital files either prior 
to or while they are being processed by the identifi- 
cation module 14. Commercially available programs, 25 
for example ILS from Signal Technology Inc., Goleta, 
California, may be used to convert and compress the 
digital data files obtained using either the first or sec- 
ond preferred embodiment of the recording system to 
a uniform size without information loss. This conver- 30 
sion and compression facilitates further processing of 
the data. In order to provide biologists and scientists 
with an opportunity to review the details of particular 
vocalizations and verification of the identification of 
the vocalizations by the identification module, three 35 
types of files may be derived from the digital files and 
viewed: 

(1) spectrograms, which identify the magnitude 
of the various frequencies for each time se- 
quence; 40 

(2) audiograms, which identify the frequencies of 
the time domain signal as a function of time; and 

(3) time domain files, which identify the magni- 
tude of the signal as a function of time. 

These files may be further analyzed using com- 45 
mercially available programs in order to extract perti- 
nent statistical information. The type of information . 
that may be extracted from these files and which is 
useful for analyzing the vocalizations include: 

(a) the average strength of all the frequency cpm- so 
ponents at a particular time as a function of time; 

(b) the strength of the dominant frequency at a 
particular time as a function of time; and 

(c) the standard deviation for the portion of the 
signal containing ambient noise and the portion 55 
of the signal containing the call. 

Additional programs to facilitate processing of the 
signals by the identification module 1 4 may be provid- 



ed to perform the following functions: 

(a) detecting the signal of interest using the stan- 
dard deviation of the signal; 

(b) filtering audiograms so as to eliminate ambi- 
ent noise or separate simultaneous calls from 
species having calls of different frequencies; and 

(c) smoothing the audiogram and time domain 
files by averaging. i- 

Identification System 

Figure 3, illustrates the organization of the iden- 
tification module 14. It will be understood by those 
skilled in the art that the configuration of the system 
may be adapted to specific applications, for example, 
monitoring only the vocalizations of birds, or amphib- 
ians, or both amphibians and birds, etc. 

The identification module 14 is used to discrim- 
inate wildlife calls and to identify the animal from 
which a selected call originated. Referring to Figure 3, 
the digitized file 32 created by the data analysis sys- 
tem 12 is provided to a segmentation module 34. The 
segmentation module 34 is used to determine the 
commencement of a call in a vocalization. The digi- 
tized file 32 is then provided to a feature extraction 
module 36. The feature extraction module 36 gener- 
ates a seit of numerically quantized features of the 
digitized sound segments (NQFDSS). The sound 
segments (NQFDSS) characterize the signal accord- 
ing to certain features of the signal. A prescreening 
module 38 is used to eliminate extraneous vocaliza- 
tions prior to the Identification stage. The NQFDSS 
are then provided to individual classification modules 
40. The classification modules 40 are comprised of 
neural networks which receive as inputs the 
NQFDSS and classify the animal vocalizations at 
their outputs. As shown in Figure 3, the classification 
module 40 may further classify the signal into four 
sub-classification modules 42a, 42b, 42c and 42d, 
namely a bird identification module, amphibian iden- 
tification module, mammal identification module and 
any another sound identification module. These sub- 
classification modules 42 are comprised of neural 
networks. The sub-classification modules 42a, 42b, 
42c and 42d classify the signals provided thereto into 
the particular species and record the number of calls 
44a, 44b, 44c and 44d identified for each species. 

The individual components of the identification 
system will now be described in greater detail. 

Segmentation Module 

The segmentation module 34 receives the digi- 
tized file 32 and processes the data so as to discrim- 
inate the commencement of a call from background 
noise. The processing function of the segmentation 
module 34 may be performed by a program imple- 
mented by a conventional personal computer. In the 
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preferred embodiment, the segmentation module 34 
receives a digitized file 32 containing input points, in 
which 20,000 points correspond to one second of 
analog realtime sound. The segmentation module 34 
scans 20,000 input points at a time and divides them 5 
into 625 segments each containing 32 points. The 
acoustical energy in each 32-point segment is calcu- 
lated and compared with a previously determined 
threshold value. The threshold value for the acousti- 
cal energy is of the same order as but larger than the to 
numerical representation of the acoustical energy in 
a 32-point segment of the ambient background noise 
at the site being monitored. The segment of ambient 
background noise may be recorded by the user when 
the recording system 10 is set up in the field. Alterna- 15 
tively. the computer 20 described with reference to 
the embodiment of Figure 2 may be programmed to 
record the ambient noise level at regular intervals in 
order to account for changes in weather conditions or 
noise caused by animal and/or human activity. 20 

The program searches the 625 segments in order 
to locale five contiguous.segments which exceed the 
threshold value of the acoustical energy. If all the 625 
segments have an acoustical energy greater than the 
threshold, the entire set of 20,000 points is forwarded 25 
to the feature extraction module 36. Otherwise, when 
five contiguous segments have been located, the be- 
ginning of the first segment of the first five contiguous 
segments that exceed this threshold is identified as 
the beginning of the call, to be forwarded to the fea- 30 
ture extraction module 36. Once five contiguous seg- 
ments are located, 20,000 points beginning with the 
first contiguous segment are provided to the feature 
extraction module 36. 

If none of the 32-point segments has an acoust- 35 
ical energy greater than the threshold factor, a new 
threshold is determined by calculating the value ob- 
tained by taking 2% of the acoustical energy in the 
segment of the 625 segments containing the maxi- 
mum amount of acoustical energy. The segmentation 40 
module 34 then repeats the same procedure on the 
625 segments containing the maximum amount of 
acoustical energy. 

It will be.understood by those skilled in the art that 
the specific lengths of time, the number of points in 45 
each segment and the number of segments used for 
decision criteria and other values are exemplary only . 
for purposes of description of the preferred embodi- 
ment and that these values are not necessarily appro- 
priate for all monitoring situations. Furthermore, per- so 
sons skilled in the art will also understand that the 
present invention is not limited to the segmentation 
procedure described above and contemplates any 
procedure which is capable of isolating a call. 

55 

Feature Extraction Module 

The feature extraction module 36 produces a set 



of coefficients, referred to as NQFDSS, which char- 
acterize a particular characteristic of the digitized file 
32. The feature extraction module 36 receives the 
digitized file 32 from the segmentation module 34. 
The feature extraction module 34 may characterize 
the digitized file 32 in several ways, for example using 
mel bins, cepstrum coefficients, linear predictive 
coefficients or correlation coefficients. The feature 
extraction module 36 only processes the first 11,264 
points (352 segments) from the beginning of the vo- 
calization identified by the segmentation module 34. 
The 11.264 points are processed 2,048 points at a 
time with a 1 ,024 point overlap. The first set of points 
starts at the first point and ends with point 2.048. 
These points are isolated by means of a Welch win- 
dow, the procedure for which is described in Press, et 
al, (Numerical Recipes in C, Cambridge University 
Press, 1 988), the contents of which are hereinafter in- 
corporated by reference. If the feature extraction 
module 36 characterizes the digitized file 32 using 
mel bins, the Fast Fourier transform (FFT) and the 
power spectrum of the windowed set of points is cal- 
culated. The frequency axis is then divided into 18 
segments each 20 mels in width. The 18 areas in 
these 18 segments of the power spectrum are ex- 
tracted and saved. 

The second set of 2,048 points starts with point 
1,025 and ends with point 3,072. These points are 
processed according to the same procedure for the 
first set and a further 18 numbers are extracted. This 
procedure is continued until all of the 11,264 points 
have been processed (10 sets of 2,048 points) and 
180 power spectrum numbers grouped into mel bins 
have been extracted. 

Alternatively, a feature extraction module 36 may 
be used which characterizes the digitized file 32 us- 
ing cepstrum coefficients. In order to characterize us- 
ing cepstrum coefficients, the feature extraction 
module also uses the first 11 .264 points of the vocal- 
ization, which are divided into 10 overiapping sets of 
2,048 points each. Each of the sets is windowed us- 
ing a Welch window. Each of the sets is processed so 
as to produce 24 cepstrum coefficients for each set. 
Cepstrum coefficients are used to characterize the 
speaker's vocal chord characteristics. The procedure 
for deriving the cepstrum coefficients for each set is 
found inS. Furui, Cepstral Analysis Technique for Au- 
tomatic Speaker Verification. IEEE Transactions on 
Acoustics, Speech and Signal Processing, Vol. 
ASSP-29, No. 2, April 1981 . pages 254-272, the con- 
tents of which are hereinafter incorporated by refer- 
ence. The NQFDSS for the second module are com- 
posed of 240 cepstrum coefficients. 

The procedures for characterizing the digitized 
file 32 using either linear predictive coefficients or 
correlation coefficients are described in S. Furui, 
Cepstral Analysis Technique for Automatic Speaker 
Verification, IEEE Transactions on Acoustics, 
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Speech and Signal Processing, Vol. ASSP-29 No. 2, 
April 1981, pages 254-272. It will be understood by a 
person skilled in the art that there are additional alter- 
native ways of characterizing the digitized file 32. 

5 

Prescreening Module 

The NQFDSS derived by the feature extraction 
module 36 are provided to a prescreening module 38. 
The prescreening module 38 is designed to screen io 
out extraneous vocalizations prior to the identifica- 
tion stage. This screening out process improves the 
reliability and efficiency of the identification stage. 

The design process for the prescreening module 
38 is similar to an unsupervised clustering process is 
based on Euclidean distance, which is described in Y. 
Pao, Adaptive Pattern Recognition and Neural Net- 
works, Addison-Wesley, 1989, the contents of which 
are hereinafter incorporated by reference. The 
NQFDSS of a set of identified training sample calls 20 
are obtained from the feature extraction module 36. 
The NQFDSS are then normalized in the range 0 to 
1. The training samples are then processed to deter- 
mine the clusters according to the following sequence 
of steps: 25 

1. n==1 

2. cluster 1 = training sample 1 

3. Increment n 

4. Stop if n > number of training samples 

5. Find cluster i, that is closest to training sample 30 
n. 

If the Euclidian distance is greater than a thresh- 
old (for example, 2.5), create a new cluster i + 1 
= sample n. Else, add sample n to cluster i and ad- 
just cluster i such that it is at the centroid of alt the 35 
samples in that cluster. 

6. Go to step 3, 

Once the prescreening module 38 has been de- 
signed, any sample that is outside the specified dis- 
tance from all the clusters is termed "unknown" and 40 
is discarded prior to the identif ication range. 

Classification Module 

The classification module consists of a multitay- 45 
er, fully connected, feedforward percept ron type of 
neural network such as described in McClelland J.L. . 
et al. Parallel Distributed Processing, Vol. 1, MIT 
Press, 1 986, the contents of which are hereinafter in- 
corporated by reference. In the preferred embodi- so 
ment, each neural network consists of an input layer, 
an output layer and a single hidden layer. The number 
of neurons in the input layer corresponds to the num- 
ber of NQFDSS provided by the feature extraction 
module 36. Accordingly, 180 neurons are used in the 55 
input layer of the classification module 40 when the 
segmented signal is characterized by the feature ex- 
traction module 36 using mel bins and 240 neurons 



are used in the input layer when the segmented signal 
is characterized using cepstrum coefficients. The 
number of output neurons in the output layer corre- 
sponds to the number of possible categories into 
which the segmented signal, or vocalization, may be 
classified. The classification module in Figure 3 
would have four neurons in the output layer corre- 
sponding to the four possible classifications. The 
number of hidden layers and the number of neurons 
in each hidden layer is determined empirically in order 
to maximize performance on specific types of sounds 
being monitored. However, for a particular application 
of the system described, one hidden layer containing 
20 neurons is used. In each layer the neurons provide 
outputs between 0 and 1. 

In the present embodiment, in the training phase 
of the neural network, the interconnection strengths 
between neurons are randomly set to values between 
+0.3 and -0.3. A test digitized file is provided to the 
segmentation module 34. The NQFDSS derived from 
the feature extraction module 36 are normalized to 
values between 0 and 2 and are used as inputs to 
each of the input neurons. The network is trained us- 
ing a back propogation algorithm such as described 
in the afore-mentioned paper of McClelland et al. The 
learning rate and momentum factor are set to 0,01 
and 0,6, respectively. The training process is carried 
out until the maximum error on the training sample 
reaches 0.20 or less (the maximum possible error be- 
ing 1.0). In other words, training continues until the 
activation of the correct output neuron is at least 0.8 
and the activation of the incorrect output neurons Is 
less than 0.2, 

In order to classify a call, the NQFDSS for the call 
are provided to the input neurons of the classification 
module 40. Only when the responses of all the output 
neurons are less than a specified value VI (for exam- 
ple, where VI is less than 0.5) with the exception of 
one output neuron whose response is above a speci- 
fied value V2 (for example, where V2 is greater than 
0.5) is the network deemed to have made a classifi- 
cation of the call Into the family 42 corresponding to 
the output neuron having the output greater than V2. 
When a response other than the foregoing is ob- 
tained, the classification network is deemed to be un- 
decided. 

It may be desirable to classify further the 
NQFDSS into the specie of the animal from whom the 
call originated. As shown In Figure 3, each family 
identification module 42a, 42b, 42c and 42d may be 
further classified to determine the number of calls re- 
corded which originated from a particular species 44. 
Each of the identification modules 42a, 42b, 42c and 
42d include a neural network similar to the neural net- 
works described with reference to the classification 
module 40. However, the number of neurons in the 
output layer of the identification modules 42a, 42b, 
42c and 42d would correspond to the number of spe- 
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cies to be identified by each family identification 
module 42. When the classification module 40 iden- 
tifies one of the family identification modules 42a. 
4 2b. 42c or 42d, the NQFDSS are then provided to the 
input layer of the identified family classification mod- 5 
ule. In the identified classification module, the neu- 
ron in the output layer corresponding to the animal 
species which originated the call will respond with an 
output greater than V2. 

It will be apparent to a person skilled in the art that io 
further classification schemes are possible in order to 
more efficiently and reliably classify vocalizations. 
For example, the bird identification module may cor- 
rectly identify the majority of vocalizations but incon- 
sistently identify a subgroup of birds which contains 15 
species A, B, C and D. The network may be retrained 
to place calls of all birds of this subgroup into a single 
identification category and forward the appropriate 
set of NQFDSS to another identification module 
which has been specifically trained to identify spe- 20 
cies of this subgroup from calls presented to it from 
this subgroup only. 

In order to improve the reliability of the identifi- 
cation module 14, a system which combines the out- 
puts of two or more neural networks analyzing the 25 
same signal has been designed. As shown in Figure 
4, the digitized file 32 is segmented by the segmen- 
tation module 34 and the digitized file 32 is then pro- 
vided to two feature extraction modules 46a and 46b. 
Each feature extraction module 46a and 46b charac- 30 
terizes the digitized file 32 using a different techni- 
que, for example mel bins and cepstrum coefficients. 
The NQFDSS generated by each feature extraction 
module 46a and 46b are respectively provided to a 
prescreening module 48a and 48b and classification 35 
module 50a and 50b, The feature extraction modules 
46a and 46b, prescreening modules 48a and 48b, and 
classification modules 50a and 50b function in the 
same way as the equivalent modules in Figure 3. 

In a system having two classification modules 40 
50a and 50b, such as shown in Figure 4, there are four 
possible results: (a) both classification modules 50a 
and 50b may make the same classification; (b) one 
module makes a definite classification while the other 
module is undecided; (c) both modules are undecid- 45 
ed; or (d) both modules may make conflicting classi- 
fications. A combine module 52 is used to rationalize . . 
the possible outputs from the classification modules 
50a and 50b. There are several techniques that may 
be used by the combining module 52 for combining so 
the results from each module 50a and 50b and there- 
by improving the efficiency of the results obtained by 
either individually. According to one such technique, 
when result (a) is obtained, the classification made by 
both modules is accepted as correct; when result (b) 55 
is obtained, the definite classification is accepted as 
correct; and when results (c) and (d) are obtained, the 
particular sound is tagged for possible future review 



by a human auditor and no conclusion Is reached. 

An alternative technique that may be used by the 
combining module 52 Involves averaging the re- 
sponse obtained for each output neuron from one 
classification module with the response from the cor- 
responding neuron output from the other classifica- 
tion module(s). This technique is most suitable when 
neuron responses range between 0.3 and 0.7 and 
where there are more than two classification modules 
to be combined. 

The classification derived by the combine mod- 
ule 52 may be further subclassified for example as is 
shown in Figure 3. 

In another aspect, the present invention also pro- 
vides for the estimation of the relative abundance of 
species in a terrestrial environment. A multi-channel 
recording system is used to record environmental, 
sounds containing vocalizations. The microphones 
associated with the recorder are positioned at three 
or more discrete sampling locations in a triangular 
scheme whereby a vocalization in the vicinity of one 
microphone can also be detected by the remaining 
microphones. The microphones record simultaneous- 
ly using either the time triggered technique or the 
voice activated technique. The amplitude of sound re- 
corded by each of the microphones will vary depend- 
ing on the position of the microphone with respect to 
the animal or animals from whom the call or calls or- 
iginated. 

It will be appreciated that the specific embodi- 
ments described above can be varied in numerous 
ways to suit particular requirements without depart- 
ing from the scope of the invention. For example, the 
system can be modified for use in monitoring wildlife 
auditory data in an aquatic environment, hydro- 
phones being used in place of microphones. 

Claims 

1. An automated monitoring system for monitoring 
wildlife vocalizations comprising: 

means (8) for receiving auditory data from 
said vocalizations. 

means (10) for recording the auditory data 
in digital format, 

means (12) for processing the recoi-ded 
data, and 

means (14) for identifying predetermined 
characteristics of the recorded and processed 
data thereby to identify the wildlife species from 
which the vocalizations are derived. 

2. An automated monitoring system according to 
claim 1, wherein said receiving means (8) com- 
prise at least one microphone tethered directly to 
the recording means (10). 
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3. An automated monitoring system, wherein said 
receiving means (8) comprise at least one micro- 
phone {26a) coupled to the recording means (10) 
by a radiof requency link. 

5 

4. An automated monitoring system according to 
claim 1, wherein said receiving means (8) com- 
prise at least three microphones (26a, 26b, 26c) 
located at different locations and coupled to the 
recording means (1 0) by radiof requency links de- io 
fining respective receiving channels. 

5. An automated monitoring system according to 
claim 1, wherein said recording means (10) com- 
prise an analog-to-digital converter (22) connect- 15 
ed to a personal computer (20). 

6. An automated monitoring system according to 
claim 1, wherein said recording means (10) is 
time- triggered to record the auditory data at sue- 20 
cessive discrete intervals of equal duration. 

7. An automated monitoring system according to 
claim 1, wherein said recording means (10) is 
sound-activated to record the auditory data at 25 
successive discrete intervals of equal duration. 

8. An automated monitoring system according to 
claim 1, further comprising means (22a) for sens- 
ing environmental data, said recording means be- 30 
ing adapted to record the environmental data in 
digital format. 

9. An automated monitoring system according to 
claim 8, wherein said sensing means (22a) are 35 
temperature, barometric pressure and wind ve- 
locity sensors. 

10. An automated monitoring system according to 
claim 11, wherein said processing means for 40 
processing the recorded data comprises means 

for formatting the data into digital data files and 
means for compressing the digital data files to a 
uniform size without information loss, 

45 

11. An automated monitoring system according to 
claim 10, wherein said processing means further _ 
comprises means for deriving from the recorded 
data spectrograms, audiograms and time domain 
representations. 50 

12. An automated monitoring system according to 
claim 11, wherein said processing means further 
comprises means for deriving from the recorded 
data 55 

(i) the average strength of all the frequency 
components at a particular time as a function 
of time. 



(ii) the strength of the dominant frequency at 
a particular time as a function of time, and 

(iii) the standard deviation for the portion of 
the auditory data containing noise and the 
portion of the auditory data representing the 
vocalization. 

13. An automated monitoring system according to 
claim 11, wherein said processing means further 
comprises: 

(a) means for detecting a vocalization using 
the standard deviation of said digitally record- 
ed received sounds; 

(b) means for filtering said audiogram so as to 
eliminate ambient noise or separate simulta- 
neous vocalizations from species having vo- 
calizations of different frequencies; and 

(c) means for smoothing said audiogram and 
said time domain representation by averag- 
ing. 

14. An automated monitoring system according to 
claim 1 , wherein said processing means includes 
a digital-to-analog converter (33), for converting 
said digitally recorded received sounds to analog 
form, connected to a speaker (35). 

15. An automated monitoring system according to 
claim 1, wherein said identifying means (14) 
comprises: 

(a) means for determining the commence- 
ment of a vocalization in said recorded audi- 
tory data, 

(b) means for characterizing a received vocal- 
ization into a set of numerically quantized fea- 
tures of said vocalization using a characteriz- 
ing technique; and 

(c) means (40) for classifying the vocalization 
according to said numerically quantized fea- 
tures. 

16. An automated monitoring system according to 
claim 1 5, wherein said means for determining the 
commencement of said vocalization performs the 
following steps: 

(a) dividing the digitally recorded data into a 
plurality of segments; 

(b) caiculating the acoustical energy of each 
segment; 

(c) comparing the acoustical energy of each 
segment with a predetermined threshold val- 
ue; and 

(d) locating a predetermined number of con- 
tiguous segments whose acoustical energy 
exceed said threshold value. 

17. An automated monitoring system according to 
claim 15, wherein said characterizing means 



8 



15 EP 0 629 996 A2 16 



characterizes said recorded auditory data using 
mel bins, cepstrum coefficients, linear predictive 
coefficients or correlation coefficients. 

18. An automated monitoring system according to 5 
claim 15. wherein said classifying means com- 
prises a neural network. 

19. An automated monitoring system according to 
claim 18, wherein the neural network is a multi- io 
layer fully connected feed forward perception 
type having an input layer, an output layer and at 
least one hidden layer. 

20. An automated monitoring system according to 15 
claim 19, wherein the number of output neurons 

in said input layer corresponds to the number of 
numerically quantized features in said set and the 
number of neurons In the output layer corre- 
sponds to the number of possible classifications 20 
for said vocalization. 

21. An automated monitoring system according to 
claim 20, wherein the activation of each output 
neuron in said output layer identifies a further 25 
sub-classification neural network which has an 
input layer which receives said numerically quan- 
tized features and an output layer having a neu- 
ron for each possible classification for said vocal- 
ization. 30 

22. An automated monitoring system according to 
claim 21 , further comprising a successive classi- 
fication neural network for use in further classi- 
fying said vocalization. 35 
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